Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation

نویسندگان

  • Paul Bieganski
  • John Riedl
  • John V. Carlis
  • Ernest F. Retzel
چکیده

This paper addresses applications of sujjix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix tree and GST operations needed to support sequence data analysis. While those &finitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST processing is fast because the structure is content addressable, supporting efJicient searches for all sequences that contain particular subsequences. Instead of laboriously searching sequences stored as arrays, we search by walking down the tree. We present a new GSTbased sequence alignment algorithm, called GESTALT. GESTALT f inds all exact matches in parallel, and uses best-first search to extend them to produce alignments. Our implementation experiences with applications using GST structures for sequence analysis lead us to conclude that GSTs are valuable tools for analyzing biological sequence data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Suffix Trees (and Relatives) Come of Age in Bioinformatics

The book Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology [1] contains about 125 pages devoted to suffix trees, suffix arrays, and their applications in computational biology. A related data structure, the DAWG is discussed via exercises. The book contains a wide range of applications of suffix trees, and while most have a biological “motivation”, at the ti...

متن کامل

Search-Optimized Persistent Suffix Tree Storage for Biological Applications

The suffix tree is a well known and popular indexing structure for various sequence processing problems arising in biological data management. However, unlike traditional indexing structures, suffix trees are orders of magnitude larger than the underlying data. Moreover, their construction and search algorithms are extremely inefficient when implemented directly on disk. Recently, we have shown...

متن کامل

Search-Optimized Suffix-Tree Storage for Biological Applications

Suffix-trees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of disk-resident suffix-trees through customized layouts of tree-nodes to disk-pages. Specifically, we propose a new layout strategy, called Stellar, that provides significantly improved search performance on ...

متن کامل

Applications of String Mining Techniques in Text Analysis

The focus of this project is on the algorithms and data structures used in string mining and their applications in bioinformatics, text mining and information retrieval. More specific, it studies the use of suffix trees and suffix arrays for biological sequence analysis, and the algorithms used for approximate string matching, both general ones and specialized ones used in bioinformatics, like ...

متن کامل

Attack of the Mutant Suffix Trees

This is a thesis for the degree of filosofie licentiat (a Swedish degree between Master of Science and Ph.D.). It comprises three articles, all treating variations and augmentations of suffix trees, and the capability of the suffix tree data structure to efficiently capture similarities between different parts of a string. The presented applications are in the areas of data compression and patt...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994